In todays session we will continue to review the Myc ChIPseq we were working on in our last session.
This include Myc ChIP-seq for MEL and Ch12 celllines as well as their input controls.
Information and files for the Myc ChIPseq in MEL cell line can be found here
Information and files for the Myc ChIPseq in Ch12 cell line can be found here
Input control can be found for MEL cell line can be found here
Input control can be found for Ch12 cell line can be found here.
For some discussions:
The ChIPQC package wraps some of the metrics into a Bioconductor package and takes care to measure these metrics under the appropriate condition.
To run a single sample we can use the ChIPQCsample() function, the relevant unfiltered BAM file and we are recommended to supply a blacklist as a BED file or GRanges and Genome name.
You can find a Blacklist for most genomes at Anshul Kundaje’s site or directly from the Encode websites
QCresult <- ChIPQCsample(reads="/pathTo/myChIPreads.bam",
genome="mm10",
blacklist = "/pathTo/mm10_Blacklist.bed")We can display our ChIPQCsample object which will show a basic summary of our ChIP-seq quality.
chipqc_MycMel_rep1## ChIPQCsample
## Number of Mapped reads: 17434045
## Number of Mapped reads passing MapQ filter: 14027340
## Percentage Of Reads as Non-Duplicates (NRF): 100(0)
## Percentage Of Reads in Blacklisted Regions: 6
## SSD: 2.04563248560836
## Fragment Length Cross-Coverage: 0.00240586015883558
## Relative Cross-Coverage: 1.36747914104688
## Percentage Of Reads in GenomicFeature:
## ProportionOfCounts
## BlackList 0.06945900
## LongPromoter20000to2000 0.25996440
## Promoters2000to500 0.04602540
## Promoters500 0.04435780
## All5utrs 0.02186480
## Alltranscripts 0.56982008
## Allcds 0.02862382
## Allintrons 0.51979606
## All3utrs 0.02149253
## Percentage Of Reads in Peaks: NA
## Number of Peaks: 0
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: no sequences
All ChIPQC functions can work with a named list of ChIPQCsample objects to aggregate scores into table as well as plots.
Here we use the QCmetrics() function to ive an overview of quality metrics.
QCmetrics(myQC)## Reads Map% Filt% Dup% ReadL FragL RelCC SSD RiP%
## Sorted_Myc_Ch12_1.bam 10993821 100 21.9 0 35 186 1.070 3.82 NA
## Sorted_Myc_Ch12_2.bam 10060460 100 18.4 0 35 146 1.310 2.84 NA
## Sorted_Myc_MEL_1.bam 10111080 100 20.8 0 35 177 1.220 3.42 NA
## Sorted_Myc_MEL_2.bam 10686984 100 23.4 0 35 177 1.050 4.20 NA
## Sorted_Input_MEL.bam 8591703 100 23.6 0 36 184 0.620 4.51 NA
## Sorted_Input_Ch12.bam 10429244 100 23.7 0 35 182 0.275 4.26 NA
## RiBL%
## Sorted_Myc_Ch12_1.bam 12.8
## Sorted_Myc_Ch12_2.bam 10.1
## Sorted_Myc_MEL_1.bam 11.9
## Sorted_Myc_MEL_2.bam 14.0
## Sorted_Input_MEL.bam 15.9
## Sorted_Input_Ch12.bam 12.8
The prediction of fragment length is an essential part of ChIP-seq affecting peaks calling, summit identification and coverage profiles.
The use of cross-correlation or cross-coverage allows for an assessment of reads clustering by strand and so a measure of quality.
In ChIP-seq typically short single end reads of dsDNA.
3’ of fragment end will be on “-” strand.
“-” reads only in negative
The plotCC function can be used to plot our cross-coverage profiles
The plotCC() function accepts our list of ChIPQC sample objects and a facetBy argument to allow us to group our cross-coverage profiles.
plotCC(myQC,facetBy = "Sample")We can include additional metadata to allow us to group our plot in different ways.
We can include metadata as a data.frame where the first column is our sample names.
myMeta <- data.frame(Sample= names(myQC),
Tissue=c("Ch12","Ch12","MEL","MEL","MEL","Ch12"),
Antibody=c(rep("Myc",4),rep("Input",2)))
myMeta## Sample Tissue Antibody
## 1 Sorted_Myc_Ch12_1.bam Ch12 Myc
## 2 Sorted_Myc_Ch12_2.bam Ch12 Myc
## 3 Sorted_Myc_MEL_1.bam MEL Myc
## 4 Sorted_Myc_MEL_2.bam MEL Myc
## 5 Sorted_Input_MEL.bam MEL Input
## 6 Sorted_Input_Ch12.bam Ch12 Input
All plots in ChIPQC are in fact built in ggplot2 so we can edit and update our plot like all ggplot objects.
plotCC(myQC,facetBy = "Tissue",addMetaData = myMeta,
colourBy="Antibody")+theme_bw()+
ggtitle("ChIPQC results")Since SSD score is strongly affected by blacklisting it may be necessary to change the axis to see any differences between samples for post-blacklisting scores.
Higher post-blacklisted SSD scores reflect samples with stronger peak signal.
plotSSD(myQC)+xlim(0.2,0.4)We should compare our duplication rate across samples to identify any sample experiencing overamplification and so potential of a lower complexity.
The flagtagcounts() function reports can report the number of duplicates and total mapped reads and so from there we can calculate our duplication rate.
myFlags <- flagtagcounts(myQC)
myFlags["DuplicateByChIPQC",]/myFlags["Mapped",]## Sorted_Myc_Ch12_1.bam Sorted_Myc_Ch12_2.bam Sorted_Myc_MEL_1.bam
## 0.07203883 0.08633293 0.15987076
## Sorted_Myc_MEL_2.bam Sorted_Input_MEL.bam Sorted_Input_Ch12.bam
## 0.06253850 0.14269010 0.13873057
ptesting
Set bioconda channels accroding to the construction
Install packages e.g. MACS2
We can still run our MACS2 from the comfort of R using the system() function.
We simply need to build our MACS2 peak call command in R and pass to the system() function.
myChIP <- "Sorted_Myc_MEL_1.bam"
myControl <- "Sorted_Input_MEL.bam"
macsCommand <- paste0("macs2 callpeak -t ", myChIP,
" -n ", "Mel_Rep1",
" –-outdir ","PeakDirectory",
" -c ", myControl)
system(macsCommand)In addition to the genomic coordinates of peaks, these files contain useful information on the samples, parameters and version used for peak calling at the top.
## [1] "# Command line: callpeak -t Sorted_Myc_MEL_1.bam -n Mel1 -c Sorted_Input_MEL.bam"
## [2] "# ARGUMENTS LIST:"
## [3] "# name = Mel1"
## [4] "# format = AUTO"
## [5] "# ChIP-seq file = ['Sorted_Myc_MEL_1.bam']"
## [6] "# control file = ['Sorted_Input_MEL.bam']"
## [7] "# effective genome size = 2.70e+09"
## [8] "# band width = 300"
We can import peak files therefore using read.delim function. Note we have set comment.char argument to # to exclude additional information on peak calling parameters stored within the MACS peak file.
macsPeaks_DF <- read.delim(macsPeaks,comment.char="#")
macsPeaks_DF[1:2,]## chr start end length abs_summit pileup X.log10.pvalue.
## 1 chr1 4785371 4785642 272 4785563 20.89 10.66553
## 2 chr1 5082993 5083247 255 5083123 33.42 12.68072
## fold_enrichment X.log10.qvalue. name
## 1 5.33590 7.37727 Mel1_peak_1
## 2 4.30257 9.27344 Mel1_peak_2
As we have seen before elements in GRanges can accessed and set using various GRanges functions. Here we can deconstruct our object back to contig names and interval ranges.
seqnames(macsPeaks_GR)## factor-Rle of length 16757 with 21 runs
## Lengths: 916 822 1795 582 596 ... 1089 836 954 395 7
## Values : chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
## Levels(21): chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
ranges(macsPeaks_GR)## IRanges object with 16757 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 4785371 4785642 272
## [2] 5082993 5083247 255
## [3] 7397544 7398115 572
## [4] 7616290 7616433 144
## [5] 8134747 8134893 147
## ... ... ... ...
## [16753] 2657144 2657294 151
## [16754] 90784142 90784289 148
## [16755] 90818471 90818771 301
## [16756] 90824549 90824905 357
## [16757] 90825407 90825575 169
We will want to remove any peaks overlapping blacklisted regions prior to any downstream analysis. We can do this using simple overlapping with GRanges objects.
library(rtracklayer)
blkList <- import.bed("mm10-blacklist.bed")
macsPeaks_GR <- macsPeaks_GR[!macsPeaks_GR %over% blkList] So far we have been working with ChIP-seq peaks corresponding to transcription factor binding. Transcription factors, as implied in the name, can affect the expression of their target genes.
The target of transcription factor is hard to assertain from ChIP-seq data alone and so often we will annotate peaks to genes by a simple set of rules:-
Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance.)
A useful package for annotation of peaks to genes is ChIPseeker.
By using pre-defined annotation in the from of a TXDB object for mouse (mm9 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.
First load the libraries we require for the next part.
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(GenomeInfoDb)
library(ChIPseeker)The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.
peakAnno <- annotatePeak(macsPeaks_GR, tssRegion=c(-500, 500),
TxDb=TxDb.Mmusculus.UCSC.mm10.knownGene,
annoDb="org.Mm.eg.db")## >> preparing features information... 2018-05-11 12:07:31
## >> identifying nearest features... 2018-05-11 12:07:31
## >> calculating distance from peak to TSS... 2018-05-11 12:07:32
## >> assigning genomic annotation... 2018-05-11 12:07:32
## >> adding gene annotation... 2018-05-11 12:07:34
## >> assigning chromosome lengths 2018-05-11 12:07:34
## >> done... 2018-05-11 12:07:34
class(peakAnno)## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"
The csAnno object contains the information on annotation of individual peaks to genes.
To extract this from the csAnno object the ChIPseeker functions as.GRanges or as.data.frame can be used to produce the respective object with peaks and their associated genes.
peakAnno_GR <- as.GRanges(peakAnno)
peakAnno_DF <- as.data.frame(peakAnno)Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker’s plotting functions to display distribution of peaks in gene features. Here we use the plotAnnoBar function to plot this as a bar chart but plotAnnoPie would produce a similar plot as a pie chart.
plotAnnoBar(peakAnno)ChIPseeker can also offer a succinct plot to describe the overlap between annotations.
upsetplot(peakAnno, vennpie = F)